This practical focuses on data wrangling, exploratory data analysis, visualization, and critical comparison with published results using real biomedical data.
We will work with the publication Proteomic and Metabolomic Characterization of COVID-19 Patient Sera
Download and load Supplementary Table 1 (“Additional Demographical and Baseline Characteristics of COVID-19 Patients and Control Groups”) from the supplementary materials of the publication.
Reproduce Table 1 from the manuscript using the supplementary data. Match the reported summary statistics as closely as possible.
Briefly discuss any discrepancies between your results and the published table, and provide possible explanations.
In this first part, we reproduce Table 1 from the manuscript by calculating all the required statistics. The code below starts by loading and preparing the data for the posterior calculations, and then performs all these statistical calculations. Finally, it produces the output table replicating Table 1 in the manuscript, followed by the corresponding discussion. It also has to be noted that everything related to Symptoms, Chest CT, Comordibity, Oxygenation Index, and Treatment was not included in our table, as these data were not in the Supplementary data downloaded.
# Import all the necessary libraries for this exercise and the others
library(readxl)
library(dplyr)
library(ggplot2)
library(tidyr)
library(corrplot)
library(knitr)
library(kableExtra)
library(patchwork)
library(pheatmap)
library(scales)
# --- PREPARE DATA ---
# Load data from excel and treat "/" and empty strings as missing values
df <- read_excel("table_s1.xlsx", sheet = 2, na = c("", "/"))
# Remove the spaces from the column names
names(df) <- gsub(" ", "_", names(df))
names(df) <- gsub("\\(", "", names(df))
names(df) <- gsub("\\)", "", names(df))
# Convert date format and calculate required time differences
df <- df %>%
mutate(
# Convert columns to Date format
Onset_date_f = as.Date(Onset_date_f),
Admission_date = as.Date(Admission_date),
Date_of_progression_to_severe_state = as.Date(Date_of_progression_to_severe_state),
# Calculate time differences in days
time_onset_to_admission = as.numeric(Admission_date - Onset_date_f + 1),
time_admission_to_severe = as.numeric(Date_of_progression_to_severe_state - Admission_date + 1)
)
# Create group labels
df <- df %>%
mutate(
Group = case_when(
Group_d == 0 ~ "healthy",
Group_d == 1 ~ "non_covid",
Group_d == 2 ~ "non_severe",
Group_d == 3 ~ "severe"
)
)
# Create sex labels
df <- df %>%
mutate(Sex_g = factor(Sex_g,
levels = c(0, 1),
labels = c("Female", "Male")))
# Create an expanded dataset, in order to include the groups "Total Covid"
# and "Global" when calculating and plotting statistics later.
# Create "Total Covid" subset
df_covid_total <- df %>%
filter(Group %in% c("non_severe", "severe")) %>%
mutate(Group = "total_covid")
# Create "Global" subset (All of the rows)
df_global <- df %>%
mutate(Group = "Global")
# Bind everything together into expanded dataset and set groups differentiation.
df_expanded <- bind_rows(df, df_covid_total, df_global)
df_expanded$Group <- factor(df_expanded$Group,
levels = c("healthy", "non_covid", "non_severe", "severe", "total_covid", "Global"))
# --- CALCULATE STATISTICS ---
# AGE & BMI — Statistics by group
age_bmi_stats <- df_expanded %>%
group_by(Group) %>%
summarise(
# Age
mean_age = mean(Age_year, na.rm = TRUE),
sd_age = sd(Age_year, na.rm = TRUE),
median_age = median(Age_year, na.rm = TRUE),
Q1_age = quantile(Age_year, 0.25, na.rm = TRUE),
Q3_age = quantile(Age_year, 0.75, na.rm = TRUE),
iqr_age = IQR(Age_year, na.rm = TRUE),
min_age = min(Age_year, na.rm = TRUE),
max_age = max(Age_year, na.rm = TRUE),
# BMI
mean_bmi = mean(BMI_h, na.rm = TRUE),
sd_bmi = sd(BMI_h, na.rm = TRUE),
median_bmi = median(BMI_h, na.rm = TRUE),
Q1_bmi = quantile(BMI_h, 0.25, na.rm = TRUE),
Q3_bmi = quantile(BMI_h, 0.75, na.rm = TRUE),
iqr_bmi = IQR(BMI_h, na.rm = TRUE),
min_bmi = min(BMI_h, na.rm = TRUE),
max_bmi = max(BMI_h, na.rm = TRUE),
.groups = "drop"
)
# Time - Statistics by group
time_stats <- df_expanded %>%
group_by(Group) %>%
summarise(
# Time: Onset to Admission
mean_onset_adm = mean(time_onset_to_admission, na.rm = TRUE),
sd_onset_adm = sd(time_onset_to_admission, na.rm = TRUE),
median_onset_adm = median(time_onset_to_admission, na.rm = TRUE),
Q1_onset_adm = quantile(time_onset_to_admission, 0.25, na.rm = TRUE),
Q3_onset_adm = quantile(time_onset_to_admission, 0.75, na.rm = TRUE),
iqr_onset_adm = IQR(time_onset_to_admission, na.rm = TRUE),
min_onset_adm = if(all(is.na(time_onset_to_admission))) NA else min(time_onset_to_admission, na.rm = TRUE),
max_onset_adm = if(all(is.na(time_onset_to_admission))) NA else max(time_onset_to_admission, na.rm = TRUE),
# Time: Admission to Severe
mean_adm_sev = mean(time_admission_to_severe, na.rm = TRUE),
sd_adm_sev = sd(time_admission_to_severe, na.rm = TRUE),
median_adm_sev = median(time_admission_to_severe, na.rm = TRUE),
Q1_adm_sev = quantile(time_admission_to_severe, 0.25, na.rm = TRUE),
Q3_adm_sev = quantile(time_admission_to_severe, 0.75, na.rm = TRUE),
iqr_adm_sev = IQR(time_admission_to_severe, na.rm = TRUE),
min_adm_sev = if(all(is.na(time_admission_to_severe))) NA else min(time_admission_to_severe, na.rm = TRUE),
max_adm_sev = if(all(is.na(time_admission_to_severe))) NA else max(time_admission_to_severe, na.rm = TRUE),
.groups = "drop"
)
# Sex - Statistics by group
sex_stats <- df_expanded %>%
group_by(Group, Sex_g) %>%
summarise(count = n(), .groups = "drop_last") %>%
mutate(proportion = count / sum(count)) %>%
ungroup()
# --- OUTPUT TABLE ---
# Count the patients in each group and create group lables for the table
group_counts <- table(df_expanded$Group)
lbl_healthy <- paste0("Healthy (N=", group_counts["healthy"], ")")
lbl_non_covid <- paste0("Non-COVID-19 (N=", group_counts["non_covid"], ")")
lbl_total <- paste0("Total COVID-19 (N=", group_counts["total_covid"], ")")
lbl_non_severe <- paste0("COVID-19 Non-severe (N=", group_counts["non_severe"], ")")
lbl_severe <- paste0("COVID-19 Severe (N=", group_counts["severe"], ")")
# Define the Column Order
col_selection <- c(
"Variable",
"Statistic",
"healthy",
"non_covid",
"total_covid",
"non_severe",
"severe"
)
# Sex
sex_rows <- sex_stats %>%
filter(Group %in% c("healthy", "non_covid", "total_covid", "non_severe", "severe")) %>%
mutate(val = sprintf("%d (%.1f%%)", count, proportion * 100)) %>%
select(Group, Sex_g, val) %>%
pivot_wider(names_from = Group, values_from = val) %>%
mutate(
# Force labels to be Male/Female text
Statistic = case_when(
as.character(Sex_g) %in% c("1", "Male") ~ "Male",
as.character(Sex_g) %in% c("0", "Female") ~ "Female",
),
Variable = "Sex - no. (%)"
) %>%
arrange(factor(Statistic, levels = c("Male", "Female")))
# Age
age_rows <- age_bmi_stats %>%
filter(Group %in% c("healthy", "non_covid", "total_covid", "non_severe", "severe")) %>%
mutate(
`Mean ± SD` = sprintf("%.1f ± %.1f", mean_age, sd_age),
`Median (IQR)` = sprintf("%.1f (%.1f-%.1f)", median_age, Q1_age, Q3_age),
`Range` = sprintf("%.1f-%.1f", min_age, max_age)
) %>%
select(Group, `Mean ± SD`, `Median (IQR)`, `Range`) %>%
pivot_longer(cols = -Group, names_to = "Statistic", values_to = "val") %>%
pivot_wider(names_from = Group, values_from = val) %>%
mutate(Variable = "Age - year")
# BMI
bmi_rows <- age_bmi_stats %>%
filter(Group %in% c("healthy", "non_covid", "total_covid", "non_severe", "severe")) %>%
mutate(
`Mean ± SD` = sprintf("%.1f ± %.1f", mean_bmi, sd_bmi),
`Median (IQR)` = sprintf("%.1f (%.1f-%.1f)", median_bmi, Q1_bmi, Q3_bmi),
`Range` = sprintf("%.1f-%.1f", min_bmi, max_bmi)
) %>%
select(Group, `Mean ± SD`, `Median (IQR)`, `Range`) %>%
pivot_longer(cols = -Group, names_to = "Statistic", values_to = "val") %>%
pivot_wider(names_from = Group, values_from = val) %>%
mutate(Variable = "BMI - kg/m2")
# Time 1: Onset to Admission
time1_rows <- time_stats %>%
filter(Group %in% c("healthy", "non_covid", "total_covid", "non_severe", "severe")) %>%
mutate(
`Mean ± SD` = ifelse(is.na(mean_onset_adm), "", sprintf("%.1f ± %.1f", mean_onset_adm, sd_onset_adm)),
`Median (IQR)` = ifelse(is.na(median_onset_adm), "", sprintf("%.1f (%.1f-%.1f)", median_onset_adm, Q1_onset_adm, Q3_onset_adm)),
`Range` = ifelse(is.na(min_onset_adm), "", sprintf("%.1f-%.1f", min_onset_adm, max_onset_adm))
) %>%
select(Group, `Mean ± SD`, `Median (IQR)`, `Range`) %>%
pivot_longer(cols = -Group, names_to = "Statistic", values_to = "val") %>%
pivot_wider(names_from = Group, values_from = val) %>%
mutate(Variable = "Time from Onset to Admission - Days")
# Time 2: Admission to Severe
time2_rows <- time_stats %>%
filter(Group %in% c("healthy", "non_covid", "total_covid", "non_severe", "severe")) %>%
mutate(
`Mean ± SD` = ifelse(is.na(mean_adm_sev), "", sprintf("%.1f ± %.1f", mean_adm_sev, sd_adm_sev)),
`Median (IQR)` = ifelse(is.na(median_adm_sev), "", sprintf("%.1f (%.1f-%.1f)", median_adm_sev, Q1_adm_sev, Q3_adm_sev)),
`Range` = ifelse(is.na(min_adm_sev), "", sprintf("%.1f-%.1f", min_adm_sev, max_adm_sev))
) %>%
select(Group, `Mean ± SD`, `Median (IQR)`, `Range`) %>%
pivot_longer(cols = -Group, names_to = "Statistic", values_to = "val") %>%
pivot_wider(names_from = Group, values_from = val) %>%
mutate(Variable = "Time from Admission to Severe - Days")
# Bind together and print
final_table <- bind_rows(sex_rows, age_rows, bmi_rows, time1_rows, time2_rows) %>%
select(all_of(col_selection)) %>%
rename(
!!lbl_healthy := healthy,
!!lbl_non_covid := non_covid,
!!lbl_total := total_covid,
!!lbl_non_severe := non_severe,
!!lbl_severe := severe
)
kbl(final_table, caption = "Table 1. Demographics and Baseline Characteristics of COVID-19 Patients") %>%
kable_classic(full_width = F, html_font = "Arial") %>%
row_spec(0, bold = TRUE) %>%
column_spec(1, bold = TRUE) %>%
# Add the separation lines
row_spec(2, extra_css = "border-bottom: 1px solid #ddd;") %>%
row_spec(5, extra_css = "border-bottom: 1px solid #ddd;") %>%
row_spec(8, extra_css = "border-bottom: 1px solid #ddd;") %>%
row_spec(11, extra_css = "border-bottom: 1px solid #ddd;") %>%
collapse_rows(columns = 1, valign = "top")
| Variable | Statistic | Healthy (N=28) | Non-COVID-19 (N=25) | Total COVID-19 (N=65) | COVID-19 Non-severe (N=37) | COVID-19 Severe (N=28) |
|---|---|---|---|---|---|---|
| Sex - no. (%) | Male | 21 (75.0%) | 17 (68.0%) | 39 (60.0%) | 23 (62.2%) | 16 (57.1%) |
| Female | 7 (25.0%) | 8 (32.0%) | 26 (40.0%) | 14 (37.8%) | 12 (42.9%) | |
| Age - year | Mean ± SD | 44.4 ± 8.3 | 49.2 ± 14.0 | 48.1 ± 13.9 | 42.9 ± 12.5 | 54.8 ± 12.8 |
| Median (IQR) | 45.0 (38.0-51.0) | 53.0 (38.0-61.0) | 47.0 (37.0-56.0) | 43.0 (36.0-51.0) | 55.0 (47.0-64.2) | |
| Range | 28.0-57.0 | 23.0-67.0 | 18.0-77.0 | 18.0-70.0 | 30.0-77.0 | |
| BMI - kg/m2 | Mean ± SD | 24.4 ± 2.7 | 23.5 ± 2.7 | 24.9 ± 3.0 | 24.5 ± 3.2 | 25.5 ± 2.5 |
| Median (IQR) | 24.2 (22.5-26.1) | 24.7 (21.1-25.4) | 24.7 (22.6-26.9) | 24.2 (21.9-26.7) | 24.9 (24.1-27.1) | |
| Range | 19.9-32.9 | 19.1-27.4 | 18.9-31.3 | 18.9-30.7 | 21.1-31.3 | |
| Time from Onset to Admission - Days | Mean ± SD | 6.0 ± 4.4 | 4.6 ± 3.2 | 7.9 ± 5.2 | ||
| Median (IQR) | 4.0 (3.0-9.0) | 4.0 (3.0-5.0) | 8.5 (3.0-11.0) | |||
| Range | 1.0-24.0 | 1.0-15.0 | 1.0-24.0 | |||
| Time from Admission to Severe - Days | Mean ± SD | 2.4 ± 1.7 | 2.4 ± 1.7 | |||
| Median (IQR) | 2.0 (1.0-3.2) | 2.0 (1.0-3.2) | ||||
| Range | 0.0-7.0 | 0.0-7.0 |
Discussion
When producing the table, the first thing we noticed was that most of the statistics related to dates appeared clearly different than in the manuscript’s Table 1. However, this was solved by adding 1 when computing the corresponding time differences. This +1 was probably added by the authors because when first calculating the time differences, one of the patients resulted in a -1 value. Another possible reason would be that they considered that all patients who had the same date for Onset and Admission or Admission and Severe had spent 1 day there and not 0.
Next, the most noticeable discrepancy appears in the number of Male/Female in the groups Total COVID-19 and COVID-19 Non-severe, where there was a swap of 2 patients. The paper listed them as Male while the data lists them as Female. Appart from this obvious difference, some other statistics differ by a few decimals, or even by a few units in some cases.
These discrepancies appear to have no explanation, as the values obtained in our table are the ones that the supplementary data provides. However, there could be several possible reasons. Firstly, it could have happened that the provided supplementary data was an updated version of the dataset, posterior to the one used in the authors’ analysis, which has slightly different values in some cases, leading, for example, to the shift of 2 patients from Male to Female after the updated correction. On the other hand, it could also be possible that the authors had applied a manual exclusion criteria to some of the data, or different rounding or computational methods in some calcualtions, leading to differences between their table and the Excel’s raw data and table we obtained.
Perform an exploratory data analysis (EDA) of the dataset. At a minimum, include:
For the exploratory data analysis, we produced a summary of the statistics of the data using the “skimr” package, which was used later for the discussion of the data quality and the observed patterns and anomalies. Considering the descriptive statistics, most of them were already computed in Exercise 1.1 and shown in the corresponding table. Therefore, this part is mostly based on the visualization of the data and the subsequent discussion.
# Convert all time-based columns to Date
df <- df %>%
mutate(across(where(~ inherits(.x, "POSIXt")), as.Date))
# Provide a summary of the data with "skimr" package
library(skimr)
skim(df)
| Name | df |
| Number of rows | 118 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| Date | 8 |
| factor | 1 |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Patient_ID_a | 0 | 1.00 | 3 | 6 | 0 | 118 | 0 |
| Proteomics | 0 | 1.00 | 2 | 3 | 0 | 2 | 0 |
| MS_ID_b | 13 | 0.89 | 4 | 7 | 0 | 105 | 0 |
| MSRep_ID_c | 112 | 0.05 | 7 | 7 | 0 | 6 | 0 |
| COVID-19_patients_corhort | 58 | 0.51 | 11 | 13 | 0 | 3 | 0 |
| Metabolomics_analysis | 0 | 1.00 | 2 | 3 | 0 | 2 | 0 |
| Metabolomics_ID_e | 3 | 0.97 | 3 | 6 | 0 | 115 | 0 |
| CRP_i,_mg/L | 32 | 0.73 | 1 | 18 | 0 | 73 | 0 |
| Group | 0 | 1.00 | 6 | 10 | 0 | 4 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| Onset_date_f | 53 | 0.55 | 2020-01-09 | 2020-02-14 | 2020-01-26 | 21 |
| Admission_date | 28 | 0.76 | 2020-01-19 | 2020-02-15 | 2020-02-01 | 24 |
| Date_of_progression_to_severe_state | 90 | 0.24 | 2020-01-19 | 2020-02-15 | 2020-02-02 | 18 |
| Sampling_date_for_metabolomics | 53 | 0.55 | 2020-01-19 | 2020-02-17 | 2020-02-02 | 22 |
| Sampling_date_for_proteomics | 58 | 0.51 | 2020-01-24 | 2020-02-17 | 2020-02-03 | 18 |
| Discharge_Date | 53 | 0.55 | 2020-02-02 | 2020-03-11 | 2020-02-21 | 26 |
| Test_date…18 | 28 | 0.76 | 2020-01-20 | 2020-02-16 | 2020-02-02 | 24 |
| Test_date…23 | 28 | 0.76 | 2020-01-20 | 2020-02-16 | 2020-02-02 | 25 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Sex_g | 0 | 1 | FALSE | 2 | Mal: 77, Fem: 41 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Group_d | 0 | 1.00 | 1.55 | 1.10 | 0.00 | 1.00 | 2.00 | 2.00 | 3.00 | ▆▆▁▇▆ |
| Age_year | 0 | 1.00 | 47.42 | 12.82 | 18.00 | 38.00 | 47.00 | 56.00 | 77.00 | ▂▇▇▇▂ |
| BMI_h | 8 | 0.93 | 24.52 | 2.88 | 18.94 | 22.49 | 24.43 | 26.64 | 32.87 | ▅▇▇▂▁ |
| WBC_count,_×109/L | 28 | 0.76 | 6.92 | 3.55 | 1.90 | 4.97 | 6.11 | 7.72 | 24.20 | ▇▅▁▁▁ |
| Lymphocyte_count,_×109/L | 28 | 0.76 | 1.26 | 0.69 | 0.20 | 0.73 | 1.18 | 1.60 | 4.10 | ▇▇▂▁▁ |
| Monocyte_count,_×109/L | 28 | 0.76 | 0.48 | 0.20 | 0.10 | 0.33 | 0.45 | 0.60 | 1.24 | ▅▇▅▁▁ |
| Platelet_count,_×109/L | 28 | 0.76 | 195.61 | 57.94 | 37.00 | 158.50 | 199.50 | 228.75 | 334.00 | ▁▃▇▃▂ |
| ALT_j,__U/L | 28 | 0.76 | 38.62 | 86.17 | 7.00 | 16.40 | 23.55 | 37.00 | 824.00 | ▇▁▁▁▁ |
| AST_k,_U/L | 28 | 0.76 | 41.86 | 80.07 | 10.00 | 21.00 | 26.00 | 35.00 | 743.00 | ▇▁▁▁▁ |
| GGT_l,_U/L | 28 | 0.76 | 39.59 | 36.82 | 9.00 | 18.25 | 28.50 | 48.75 | 264.00 | ▇▁▁▁▁ |
| TBIL_m,_μmol/L | 28 | 0.76 | 20.31 | 43.76 | 3.40 | 9.67 | 13.40 | 18.80 | 403.40 | ▇▁▁▁▁ |
| DBIL_n,_μmol/L | 28 | 0.76 | 9.02 | 27.46 | 0.80 | 3.12 | 4.75 | 6.65 | 254.40 | ▇▁▁▁▁ |
| Creatinine,_μmol/L | 28 | 0.76 | 82.26 | 35.71 | 4.00 | 66.00 | 76.50 | 89.00 | 335.00 | ▅▇▁▁▁ |
| Glucose,_mmol/L | 28 | 0.76 | 7.48 | 3.65 | 3.65 | 5.43 | 6.26 | 8.09 | 25.63 | ▇▂▁▁▁ |
| time_onset_to_admission | 53 | 0.55 | 6.00 | 4.42 | 1.00 | 3.00 | 4.00 | 9.00 | 24.00 | ▇▃▁▁▁ |
| time_admission_to_severe | 90 | 0.24 | 2.39 | 1.66 | 0.00 | 1.00 | 2.00 | 3.25 | 7.00 | ▇▅▆▁▂ |
# Remove column with almost only missing data
df <- df %>% select(-`MSRep_ID_c`)
# --- PLOTS ---
# Set the order of the groups for the plots
df_expanded$Group <- factor(df_expanded$Group,
levels = c("Global", "healthy", "non_covid", "total_covid", "non_severe", "severe"))
# Plot 1: Sex by groups
ggplot(df_expanded, aes(x = Group, fill = Sex_g)) +
geom_bar(position = "fill") +
theme_minimal() +
labs(title = "Proportion of Sex by Group", y = "Proportion", x = "", fill = "Sex") +
scale_fill_manual(values = c("Female" = "#E69F00", "Male" = "#56B4E9")) +
scale_x_discrete(labels = c(
"Global" = "Global",
"healthy" = "Healthy",
"non_covid" = "Non-COVID-19",
"total_covid" = "Total COVID-19",
"non_severe" = "COVID-19 Non-severe",
"severe" = "COVID-19 Severe"
)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot 2: BMI by Groups
ggplot(df_expanded, aes(x = Group, y = BMI_h, fill = Group)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_minimal() +
labs(
title = "Distribution of BMI by Patient Group",
x = "",
y = "BMI"
) +
scale_fill_brewer(palette = "Set2") +
scale_x_discrete(labels = c(
"Global" = "Global",
"healthy" = "Healthy",
"non_covid" = "Non-COVID-19",
"total_covid" = "Total COVID-19",
"non_severe" = "COVID-19 Non-severe",
"severe" = "COVID-19 Severe"
)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot 3: Age by Groups
ggplot(df_expanded, aes(x = Group, y = Age_year, fill = Group)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_minimal() +
labs(
title = "Distribution of Age by Patient Group",
x = "",
y = "Age"
) +
scale_fill_brewer(palette = "Set2") +
scale_x_discrete(labels = c(
"Global" = "Global",
"healthy" = "Healthy",
"non_covid" = "Non-COVID-19",
"total_covid" = "Total COVID-19",
"non_severe" = "COVID-19 Non-severe",
"severe" = "COVID-19 Severe"
)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot 4: BMI by Sex
ggplot(df, aes(x = Sex_g, y = BMI_h, fill = Sex_g)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_minimal() +
labs(
title = "Distribution of BMI by Sex",
x = "Sex",
y = "BMI",
fill = ""
) +
scale_fill_manual(values = c("Female" = "#E69F00", "Male" = "#56B4E9"))
# Plot 5: Age by Sex
ggplot(df, aes(x = Sex_g, y = Age_year, fill = Sex_g)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_minimal() +
labs(
title = "Distribution of Age by Sex",
x = "Sex",
y = "Age",
fill = ""
) +
scale_fill_manual(values = c("Female" = "#E69F00", "Male" = "#56B4E9"))
# Plot 6: Correlation Plot between Metabolomics columns
# Metabolomic columns indices
target_cols_corr <- c(18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30)
# Ensure columns are numeric; handle errors if columns contain non-numeric data, clean the variable names
df_corr <- df[, target_cols_corr]
current_names <- names(df_corr)
clean_names <- gsub(",.*", "", current_names)
clean_names <- gsub("_[a-zA-Z]$", "", clean_names)
clean_names <- gsub("_", " ", clean_names)
names(df_corr) <- clean_names
df_corr[] <- lapply(df_corr, function(x) as.numeric(as.character(x)))
# Calculate the correlation matrix
res <- cor(df_corr, use = "complete.obs")
# Generate the plot
corrplot(res,
type = "upper",
order = "hclust",
tl.col = "black",
tl.srt = 45,
tl.cex = 0.7,
diag = FALSE,
title = "Correlation of Clinical and Metabolomic Variables",
mar = c(0, 0, 2, 0)
)
> Discussion
To begin with, we noticed that missing data were labeled either with an empty space or with a “/”. However, those that contained an empty space were for a reason. For example, the patients from the healthy group did not have the columns corresponding to the metabolomics variables, as they were not tested. Also, some of the dates were not present for patients that, for example, did not become severe. On the other hand, the columns corresponding to the BMI and the CRP were the ones to contain some missing values labeled as “/”. These were treated as NA, therefore ignored when doing the plots in this section. Moreover, some other columns also included missing values (“/”), but they were not relevant columns for the statistics or visualizations.
Regarding the plots, some patterns could be detected. Firstly, the male proportion is higher across the population, but the female proportion increases for the non-healthy groups and even more for the severe COVID-19 group. Secondly, the BMI is higher for the severe group, as it could be expected. Considering the age, it is higher for the non-healthy groups than for the healthy group, and also for the severe COVID-19 group than for the non-severe, also as could be expected. Finally, regarding the BMI and age compared to sex, both are slightly higher for males, but with a little difference. To end up, the correlation matrix shows a very high positive correlation between total bilirubin and direct bilirubin, and also a quite high positive correlation of aspartate aminotransferase with alanine aminotransferase and C-reactive protein. Some more positive and negative correlations are observed between other variables, but not with that much strength.
Reproduce Supplementary Figure 1 using the data without modifying or removing any observations. Use the same variables and groupings as in the original figure.
In this part, we just produced the boxplots for the 12 metabolomics variables without modifying any data nor trying the obtain nicer plots. For this reason, most of the plots appear ugly and not interpretable. This was handled later in exercise 2.2.
# Define indexes and names
target_cols_plots <- c(19, 20, 21, 22, 24, 25, 26, 27, 28, 29, 30, 31)
target_colnames <- colnames(df_expanded)[target_cols_plots]
# Handle labels for plots
clean_titles <- c(
"White blood cell (WBC)",
"Lymphocyte",
"Monocyte",
"Platelet",
"C-reactive protein (CRP)",
"Alanine aminotransferase (ALT)",
"Aspartate aminotransferase (AST)",
"Glutamyltransferase (GGT)",
"Total bilirubin (TBIL)",
"Direct bilirubin (DBIL)",
"Creatinine",
"Glucose"
)
clean_units <- list(
expression(Count ~ (10^9/L)), # WBC (Implicit multiplication) or use (times ~ 10^9/L)
expression(Count ~ (10^9/L)), # Lymphocyte
expression(Count ~ (10^9/L)), # Monocyte
expression(Count ~ (10^9/L)), # Platelet
"mg/L", # CRP
"U/L", # ALT
"U/L", # AST
"U/L", # GGT
expression(mu * mol/L), # TBIL (µmol/L)
expression(mu * mol/L), # DBIL
expression(mu * mol/L), # Creatinine
"mmol/L" # Glucose
)
# Filter data frame and order groups
df_filtered <- df_expanded %>%
filter(Group %in% c("non_covid", "non_severe", "severe")) %>%
mutate(
Group = factor(Group, levels = c("non_covid", "non_severe", "severe")),
`CRP_i,_mg/L` = as.numeric(gsub("[^0-9.]", "", as.character(`CRP_i,_mg/L`)))
)
# Create plots list
plot_list <- list()
for (i in seq_along(target_cols_plots)) {
col_idx <- target_cols_plots[i]
colname <- colnames(df_expanded)[col_idx]
my_title <- clean_titles[i]
my_unit <- clean_units[[i]]
p <- ggplot(df_filtered, aes(x = Group, y = .data[[colname]], fill = Group)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.4, size = 1) +
theme_minimal() +
scale_y_continuous() +
labs(
title = my_title,
x = NULL,
y = my_unit
) +
scale_fill_manual(values = c("non_covid" = "#66c2a5",
"non_severe" = "#fc8d62",
"severe" = "#8da0cb")) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none",
plot.title = element_text(size = 10, face = "bold")
)
plot_list[[colname]] <- p
}
# Organize and show the 12 plots
wrap_plots(plot_list, ncol = 3, nrow = 4) +
plot_annotation(
title = "Comparison of Clinical and Metabolomic Variables by Severity",
subtitle = "Groups: Non-COVID, Non-Severe and Severe"
)
Identify and handle outliers using an appropriate and well-justified method (e.g. removal, transformation, or robust statistics).
Reproduce Supplementary Figure 1 after outlier handling and discuss how and why the results change.
In this second part of the second exercise, we tried different methods of handling outliers to improve the previous plots. The results with these methods are discussed at the end of the exercise
# Method 1: hide extreme values
df_filtered <- df_expanded %>%
filter(Group %in% c("non_covid", "non_severe", "severe")) %>%
mutate(Group = factor(Group, levels = c("non_covid", "non_severe", "severe"))) %>%
# Force specific columns to be numeric
mutate(across(all_of(colnames(df_expanded)[target_cols_plots]), ~ as.numeric(as.character(.x))))
# Generate plots list
plot_list <- list()
for (i in seq_along(target_cols_plots)) {
col_idx <- target_cols_plots[i]
colname <- colnames(df_expanded)[col_idx]
my_title <- clean_titles[i]
my_unit <- clean_units[[i]]
# Calculate the 95th percentile for the Upper Limit
# We use this to "zoom in" and hide the top 5% extreme values
upper_limit <- quantile(df_filtered[[colname]], 0.95, na.rm = TRUE)
p <- ggplot(df_filtered, aes(x = Group, y = .data[[colname]], fill = Group)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.4, size = 1) +
coord_cartesian(ylim = c(0, upper_limit)) +
theme_minimal() +
labs(
title = my_title,
x = NULL,
y = my_unit
) +
scale_fill_manual(values = c("non_covid" = "#66c2a5",
"non_severe" = "#fc8d62",
"severe" = "#8da0cb")) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none",
plot.title = element_text(size = 10, face = "bold")
)
plot_list[[colname]] <- p
}
# Organize and show the 12 plots
wrap_plots(plot_list, ncol = 3, nrow = 4) +
plot_annotation(
title = "Comparison of Metabolites (Extreme Values Clipped)",
subtitle = "View zoomed to 95th percentile; top 5% outliers hidden"
)
# Method 2: Hide outliers
# Create plots list
plot_list <- list()
for (i in seq_along(target_cols_plots)) {
col_idx <- target_cols_plots[i]
colname <- colnames(df_expanded)[col_idx]
my_title <- clean_titles[i]
my_unit <- clean_units[[i]]
# Calculate Whiskers per group
outlier_bounds <- df_filtered %>%
group_by(Group) %>%
summarise(
Q1 = quantile(.data[[colname]], 0.25, na.rm = TRUE),
Q3 = quantile(.data[[colname]], 0.75, na.rm = TRUE),
IQR_val = Q3 - Q1,
lower_bound = Q1 - 1.5 * IQR_val,
upper_bound = Q3 + 1.5 * IQR_val,
.groups = "drop"
)
# Filter Points for Jitter
# Join the bounds to the data and keep only points within the bounds
df_points_filtered <- df_filtered %>%
left_join(outlier_bounds, by = "Group") %>%
filter(.data[[colname]] >= lower_bound & .data[[colname]] <= upper_bound)
plot_min <- min(df_points_filtered[[colname]], na.rm = TRUE)
plot_max <- max(df_points_filtered[[colname]], na.rm = TRUE)
plot_min <- max(0, plot_min * 0.95)
plot_max <- plot_max * 1.05
p <- ggplot(df_filtered, aes(x = Group, y = .data[[colname]], fill = Group)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_jitter(data = df_points_filtered,
width = 0.2, alpha = 0.4, size = 1) +
coord_cartesian(ylim = c(plot_min, plot_max)) +
theme_minimal() +
labs(
title = my_title,
x = NULL,
y = my_unit
) +
scale_fill_manual(values = c("non_covid" = "#66c2a5",
"non_severe" = "#fc8d62",
"severe" = "#8da0cb")) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none",
plot.title = element_text(size = 10, face = "bold")
)
plot_list[[colname]] <- p
}
wrap_plots(plot_list, ncol = 3, nrow = 4) +
plot_annotation(
title = "Comparison of Metabolites (Outliers Removed)",
subtitle = "View restricted to statistical whiskers (Group-specific IQR)"
)
# Method 3: log-transformation
# Create plots list
plot_list <- list()
for (i in seq_along(target_cols_plots)) {
col_idx <- target_cols_plots[i]
colname <- colnames(df_expanded)[col_idx]
my_title <- clean_titles[i]
my_unit <- clean_units[[i]]
p <- ggplot(df_filtered, aes(x = Group, y = .data[[colname]], fill = Group)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.4, size = 1) +
# Use pseudo_log to handle any potential 0s safely without error
scale_y_continuous(trans = "pseudo_log") +
theme_minimal() +
labs(
title = my_title,
x = NULL,
y = my_unit
) +
scale_fill_manual(values = c("non_covid" = "#66c2a5",
"non_severe" = "#fc8d62",
"severe" = "#8da0cb")) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none",
plot.title = element_text(size = 10, face = "bold")
)
plot_list[[colname]] <- p
}
wrap_plots(plot_list, ncol = 3, nrow = 4) +
plot_annotation(
title = "Comparison of Metabolites (Log-Transformed)",
subtitle = "Outliers handled via Pseudo-Log Transformation"
)
# Method 4: log-transformation + hiding extreme values
plot_list <- list()
for (i in seq_along(target_cols_plots)) {
col_idx <- target_cols_plots[i]
colname <- colnames(df_expanded)[col_idx]
my_title <- clean_titles[i]
my_unit <- clean_units[[i]]
lower_limit <- quantile(df_filtered[[colname]], 0.02, na.rm = TRUE)
upper_limit <- quantile(df_filtered[[colname]], 0.98, na.rm = TRUE)
if (lower_limit < 0) lower_limit <- 0
p <- ggplot(df_filtered, aes(x = Group, y = .data[[colname]], fill = Group)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.4, size = 1) +
scale_y_continuous(trans = "pseudo_log", n.breaks = 4) +
coord_cartesian(ylim = c(lower_limit * 0.9, upper_limit * 1.1)) +
theme_minimal() +
labs(
title = my_title,
x = NULL,
y = my_unit
) +
scale_fill_manual(values = c("non_covid" = "#66c2a5",
"non_severe" = "#fc8d62",
"severe" = "#8da0cb")) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none",
plot.title = element_text(size = 10, face = "bold"),
axis.text.y = element_text(size = 7)
)
plot_list[[colname]] <- p
}
wrap_plots(plot_list, ncol = 3, nrow = 4) +
plot_annotation(
title = "Comparison of Metabolites (Log-Transformed + Clipped)",
subtitle = "Zoomed to 2nd-98th percentiles"
)
Discussion
Two different strategies were used: clipping and log-transforming. Method 1 just consisted of zooming in by hiding the most extreme values (top 5% outliers) and plotting only the rest of the values. The second method was based on the same, but restrincting the plots to only the statistical whiskers, therefore completely removing the statistical outliers. Next, the third method consisted on a log-transformation, which allowed to see the axis in a different scale, better for the visualization. Finally the fourth method combined both the log-transformation and the clipping, hiding the top 2% outliers. We considered all these methods to be better than the way of cutting the axis used in the paper, which resulted in a difficult way to interpret the plots.
Considering only the visualization aspect, Method 2 (completely removing outliers), resulted in the best one, as the interpretability aspect of the boxplots was better. However, this method resulted in missing so many values which were real measurements. If we were calculating statistics, maybe removing these outliers could be a good option, but in this case, the plots are hiding real values. On the other hand, the log-transformation allowed to keep all the real values in the plots, but resulted in some plots not being readable.
Other methods such as capping, or modifying in any way the values of the outliers were not considered, because it would mean showing in the plots values which were not real.
Create a heatmap of the biomarker data. Include group and gender as annotation variables.
Finally, exercise 3 produced the heatmap of the biomarker data with severity group and gender as annotation variables. Scaling of the data was required for the plot.
# Prepare Data Matrix
# Select only the 12 target numeric columns
heatmap_data <- df_filtered %>%
select(all_of(target_colnames)) %>%
as.matrix()
# Rename Columns (Variables) to Clean Titles
# We use the 'clean_titles' vector we defined in Exercise 2.2
colnames(heatmap_data) <- clean_titles
# Scale and Transpose
heatmap_matrix <- t(scale(heatmap_data))
# Create a data frame for the column annotations
annotation_info <- df_filtered %>%
select(Group, Sex_g) %>%
as.data.frame()
# Rename the columns for the Legend
colnames(annotation_info) <- c("Group", "Sex")
# Rename the Factor Levels for "Group" to be descriptive
annotation_info$Group <- factor(annotation_info$Group,
levels = c("non_covid", "non_severe", "severe"),
labels = c("Non-COVID-19", "COVID-19 Non-severe", "COVID-19 Severe"))
# Ensure row names of annotation match column names of the matrix
rownames(annotation_info) <- paste0("P", 1:ncol(heatmap_matrix))
colnames(heatmap_matrix) <- paste0("P", 1:ncol(heatmap_matrix))
ann_colors <- list(
Group = c("Non-COVID-19" = "#66c2a5",
"COVID-19 Non-severe" = "#fc8d62",
"COVID-19 Severe" = "#8da0cb"),
Sex = c("Female" = "#E69F00",
"Male" = "#56B4E9")
)
# Draw the Heatmap
library(pheatmap)
pheatmap(heatmap_matrix,
cluster_rows = TRUE, # Cluster metabolites
cluster_cols = TRUE, # Cluster patients
annotation_col = annotation_info,
annotation_colors = ann_colors,
show_colnames = FALSE, # Hide messy patient IDs
show_rownames = TRUE, # Show the clean metabolite names
fontsize_row = 10, # Adjust font size if needed
color = colorRampPalette(c("navy", "white", "firebrick3"))(100),
main = "Biomarker Heatmap: Metabolites by gender and severity group"
)
Write a concise conclusion (1 paragraphs) summarizing the main differences observed across the three patient groups based on the twelve clinical parameters. Support your statements with evidence from your analyses.
The analysis performed identified different clinical profiles distinguishing the three tested patient groups. The Non-COVID-19 group is characterized by the highest level of liver function markers (ALT, TBIL, and DBIL), as observed in the heatmap, alongside with elevated WBC counts shown in the boxplots. Within the COVID-19 positive groups, severity is marked by elevated Glucose, CRP, GGT and Bilirubin levers compared to the Non-severe group. On the other hand, non-severe patients retain higher Platelet and Lymphocyte counts, shown in the boxplots, suggesting that progression to severity is clinically defined by lymphopenia and thrombocytopenia in combination with hyperglycaemia and inflammation.
R version 4.5.2 (2025-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.4 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
locale: [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C
[3] LC_TIME=es_US.UTF-8 LC_COLLATE=es_ES.UTF-8
[5] LC_MONETARY=es_US.UTF-8 LC_MESSAGES=es_ES.UTF-8
[7] LC_PAPER=es_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Madrid tzcode source: system (glibc)
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] skimr_2.2.2 scales_1.4.0 pheatmap_1.0.13
patchwork_1.3.2 [5] kableExtra_1.4.0 knitr_1.51 corrplot_0.95
tidyr_1.3.2
[9] ggplot2_4.0.2 dplyr_1.2.0 readxl_1.4.5
loaded via a namespace (and not attached): [1] sass_0.4.10
generics_0.1.4 xml2_1.5.2 stringi_1.8.7
[5] digest_0.6.39 magrittr_2.0.4 evaluate_1.0.5 grid_4.5.2
[9] RColorBrewer_1.1-3 fastmap_1.2.0 cellranger_1.1.0
jsonlite_2.0.0
[13] purrr_1.2.1 viridisLite_0.4.3 textshaping_1.0.4
jquerylib_0.1.4
[17] cli_3.6.5 rlang_1.1.7 base64enc_0.1-6 withr_3.0.2
[21] repr_1.1.7 cachem_1.1.0 yaml_2.3.12 otel_0.2.0
[25] tools_4.5.2 vctrs_0.7.1 R6_2.6.1 lifecycle_1.0.5
[29] stringr_1.6.0 pkgconfig_2.0.3 pillar_1.11.1 bslib_0.10.0
[33] gtable_0.3.6 glue_1.8.0 systemfonts_1.3.1 xfun_0.56
[37] tibble_3.3.1 tidyselect_1.2.1 rstudioapi_0.18.0 farver_2.1.2
[41] htmltools_0.5.9 labeling_0.4.3 rmarkdown_2.30 svglite_2.2.2
[45] compiler_4.5.2 S7_0.2.1